Research on the establishment of NDVI long-term data set based on a novel method

This study compares the relationship between different NDVI (Normalized Difference Vegetation Index), the NDVI of AVHRR (Advanced Very High Resolution Radiometer) (NDVIa), the NDVI of MODIS (Moderate Resolution Imaging Spectrometer) (NDVIm), and the NDVI of VIRR (Visible and Infrared Radiometer) (NDVIv), and found that there is a significant correlation between the NDVIa and the NDVIm, and between the NDVIv and the NDVIa, the relationship between the three is NDVIv < NDVIa < NDVIm. Machine learning is an important method in artificial intelligence. It can solve some complex problems through algorithms. This research uses linear regression algorithm in machine learning to construct the Fengyun Satellite NDVI correction method. By constructing a linear regression model, the NDVI value of Fengyun Satellite VIRR is corrected to a level that is basically the same as NDVIm. The corrected correlation coefficients (R2) were significantly improved, and the corrected correlation coefficients were significantly improved, and the confidence levels were all significant correlations less than 0.01. It is proved that the corrected normalized vegetation index of Fengyun Satellite has significantly improved accuracy and product quality compared with the normalized vegetation index of MODIS.


Scientific Reports
| (2023) 13:9838 | https://doi.org/10.1038/s41598-023-36939-y www.nature.com/scientificreports/ requirements. Other advantage in publishing the plug-ins and the application code is the possibility of other users to improve this application. Feng Rui et al. 17 conducted a differential analysis on the NDVI of FY/MERSI (Medium Resolution Spectral Imager) and EOS/MODIS and found that the inversion results of various ground features showed good linear consistency; Yuan Zhengwu et al. 18 established a quantitative relationship between Landsat TM (Thematic Mapper) and HJCCD (Environmental satellite CCD camera) vegetation index, which provides a basis for the comprehensive application of Landsat TM and HJCCD data; Xu Hanqiu et al. 19 analyzed the characteristics of the red band and near-infrared band of ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) and Landsat ETM + (Enhanced Thematic Mapper) by comparing the vegetation index. Wu Wenbin et al. 20 compared the Savitzky-Golay filter method for fitting NDVI time series data and the asymmetric Gaussian function fitting method for NDVI time series data; Li Jing et al. 21 used the NDVI data of the Southwest Virginia coal field in the United States from 1984 to 2010 as the data source, and compared the filtering algorithms of the three long-term remote sensing vegetation index data sets of TIMESAT3.1 (Time-series Satellite data Analysis Tool 3.1); Sha Sha et al. 22 took Maqu as an example to compare and analyze the three sets of NDVI long-term series indices, NDVI/MODIS, NDVI/GIMMS (Global Inventory Modeling and Mapping Studies) and NDVI/NSMC (National Center for Space Weather). The vegetation index of a longer time series is a simple and effective dynamic monitoring research parameter, which is very important for monitoring surface vegetation, ecological improvement, ecological evaluation and so on 12,[17][18][19][20][21][22] .
Machine learning is an important method in artificial intelligence. It can solve some complex problems through algorithms and has become one of the most popular subjects at the moment [23][24][25][26] . The application of machine learning in remote sensing is generally divided into the following steps 27,28 : collecting and cleaning data, model building, selecting the correct algorithm, obtaining reliable results, and visualizing data. In remote sensing technology, people mainly use satellites or drones to collect data 29 . Data cleaning occurs when our data set is incomplete or missing values, and the choice of algorithm involves understanding one of the problems to be solved. If the model is only for forecasting, not for obtaining high-reliability results, then this workflow will end here. However, if a person is writing a research paper, or wants to obtain highly credible results, then you need to use a graphics library to plot the results and get the true solution from the chart data 30 .
A training sample set was built based on linear regression algorithm, combining the normalized vegetation index products retrieved by Fengyun satellite and MODIS and observation parameters, surface type, ground elevation and meteorological factors. Therefore, the NDVI product retrieved by Fengyun Satellite is corrected to the NDVI product that is basically consistent with MODIS through the machine learning model, related factors and related parameters, and a long-term normalized vegetation index is obtained.
Because of the difference of sensitivity, resolution and observation method, different detection instruments have certain differences in the detection value of NDVI. Therefore, this study compares the NDVI of MODIS with the NDVI of VIRR and AVHRR respectively. We use statistical methods to compare and analyze the normalized vegetation index of the three, and find the difference and correlation between the normalized vegetation index of VIRR, AVHRR and MODIS; based on the machine learning algorithm, the Fengyun Satellite NDVI correction algorithm is constructed in order to form a long time series of vegetation index.

Data and methods
Data. This paper selects parts of China and surrounding areas as the research area. The research data selects the NDVI data of MODIS (NDVIm) and AVHRR (NDVIa) sensors on Terra and Aqua, and the NDVI data of VIRR (NDVIv) sensors on Fengyun satellite 31 . (I) Compare the NDVIv with the NDVIa, and the NDVIa and NDVIm. (II) Find out the functional relationship between NDVIa and NDVIm, and the functional relationship between NDVIv and NDVIa through comparison. (III) use NDVIa to correct NDVIv data to a level equivalent to NDVIm.
The data used in this study include (see Table 1): NDVIa from 1982 to 2015, NDVIm from 2000 to 2019, and NDVIv from 2015 to 2020, all of which have a resolution of 0.05°. Because in 2005, there are both NDVIa data and NDVIm data. Therefore, we use the data of this year to compare NDVIa and NDVIm, and explore the correlation between the two. Because in 2015, there are both NDVIv data and NDVIa data. Therefore, we used the data of this year to compare NDVIv and NDVIa and explore the correlation between the two. Finally, we compared the corrected NDVIv of 2019 with the NDVIm of 2019 to verify the success of the model we constructed. Figure 1 shows the spectral response function curves of different satellite sensors in the visible and nearinfrared spectrum 32 . By comparison, it can be found that in the visible light band, the spectral response function of MODIS is narrower than AVHRR, and the spectral response function of AVHRR is narrower than VIRR. In the near-infrared band, MODIS still has the narrowest spectral response function, followed by VIRR, and AVHRR has the widest spectral response function. The channel, wavelength range, corresponding spectrum and sub-satellite resolution information of MODIS, AVHRR, and VIRR sensors are shown in Table 2. There are many forms of linear models, and linear regression is a common one. Linear regression tries to learn a linear model to predict the real-valued output markers as accurately as possible. By establishing a linear model on the data set, a loss function is established, and finally the model parameters are determined with the goal of optimizing the cost function, so as to obtain the model for subsequent prediction. The general linear regression algorithm process is as presented in Fig. 2. The detailed procedure is as follows 33 : (I) The data is standardized and preprocessed. The preprocessing includes data cleaning, screening, organization, etc., so that the data can be input into the machine learning model as feature variables. (II) Different machine learning algorithms are selected to train a separate data set, and find the best machine learning model, establish a machine learning model based on the normalized vegetation index product retrieved by Fengyun satellite. (III) Verify and output the long-term series normalized vegetation index of the Fengyun satellite.
For 2001-2005, there are both AVHRR NDVI data and MODIS NDVI data. Therefore, we used the data of these 5 years to compare NDVIa and NDVIm and explore the correlation between the two. Because 2015 has both VIRR's NDVI data and AVHRR's NDVI data. Therefore, we used the data of this year to compare NDVIv  www.nature.com/scientificreports/ and NDVIa and explore the correlation between the two. Finally, we compared the corrected NDVIv of 2019 with the NDVIm of 2019 to verify the success of the model we constructed.
The linear machine learning model is used to construct the optimal functional relationship between the NDVIa and the NDVIm. The formula is as presented in formula (1): In the formula, X NDVIa is the NDVI value of AVHRR, Y NDVIm is the NDVI value of MODIS, k is the coefficient value of the linear function relationship between NDVIa and NDVIm, k2001, k2002, k2003, k2004, k2005, kmin, kmax, kave are the coefficients of 2001,2002,2003,2004,2005, the 5-year minimum, 5-year maximum, and the 5-year coefficient average respectively. m is the intercept of the linear function relationship between the NDVIa and the NDVIm, m2001, m2002, m2003, m2004, m2005, mmin, mmax, mmean are the intercept of 2001,2002,2003,2004,2005 Year, 5-year minimum, 55-year maximum, and 5-year average respectively.
Through multiple cross-comparison analysis, the optimal coefficient k and the optimal coefficient m are selected, and then the optimal functional relationship between NDVIa and NDVIm is determined.
Based on the above analysis, we continue to construct the functional relationship between NDVIa and NDVIv, according to formula (2).
In the formula (2), Z NDVIv is the NDVI value of VIRR, X NDVIa is the NDVI value of AVHRR, a is the coefficient value of the linear function relationship between the NDVIv and the NDVIa fitting, and b is the intercept of the linear function relationship between NDVIv and NDVIa fitting.
Replacing the functional relationship between NDVIa and NDVIv into the optimal NDVIa and NDVIm functional relationships filtered out to obtain the refitted NDVIv, which is Yvir_ndvi in the formula (3). The functional relationship formula of the simulated NDVIv is as follows (3): In the formula, C NDVIcv is the corrected NDVIv(NDVIcv), k is the optimal coefficient of the correlation between NDVIa and NDVIm, and m is the optimal intercept of the correlation between NDVIa and NDVIm.
The data of 2005 were selected to compare NDVIm and NDVIa in some parts of China and surrounding areas. The data of 2015 were selected to compare NDVIv and NDVIa in some parts of China and surrounding areas. Through analysis, the correlation among NDVIv, NDVIa and NDVIm is found.

Comparison of NDVI between AVHRR and MODIS in parts of China and surrounding areas. By
comparing the NDVIa and NDVIm in parts of China and surrounding areas (Table 3), we found that the correlation coefficient for January, April, July, and October of 2005 was between 0.8652 and 0.9348, and the coefficient of determination was between 0.7024 and 0.8519. The confidence p is at the level of 0.01 or 0.05, indicating that the NDVIa and the NDVIm have a good correlation. Through comparison, it can be found that overall, in comparison with the NDVIa and NDVIm in 2005, except that the NDVIa in January was larger than the NDVIm, the NDVIa in the remaining months were all smaller than the NDVIm. As can be seen from the Fig. 3, in the comparison between April and October, the area with lower NDVI value, the NDVIa is greater than the NDVIm.

Comparison of NDVI between VIRR and AVHRR in parts of China and surrounding areas. By
comparing the NDVIv and NDVIa in parts of China and surrounding areas (Table 4), we found that the correlation coefficient for January, April, July, and October of 2015 was between 0.7238 and 0.8929, and the coefficient of determination was between 0.6072 and 0.8299. The confidence p is at the level of 0.01 or 0.05, indicating that there is a significant correlation between the NDVIv and the NDVIa. The NDVIa can be used to correct the NDVIv. Through comparison, it can be found that, on the whole, the NDVIv in January, April, July, and October of 2015 is smaller than the NDVIa. However, it can be seen from Fig. 4 that in the comparison in April, the NDVIv is greater than the NDVIa in areas with lower NDVI values.
(3)   www.nature.com/scientificreports/ In order to increase our selectivity in the linear model and the accuracy of the correction, we respectively compared the NDVIa and the NDVIm from January, April, July, and October from 2001 to 2005. The specific analysis and comparison results are as follows.
By comparing the NDVIa and the NDVIm in January (Table 5), it is found that from 2001 to 2005, the coefficients of the NDVIa and the NDVIm were 1.0358-1.0679, the average coefficient was 1.0553, and the intercept was − 0.0564 to − 0.0285. The average intercept is − 0.0409, the correlation coefficient r is 0.8638-0.8768, the average correlation coefficient is 0.8693, the determination coefficient R 2 is 0.7024-0.7662, the average correlation coefficient is 0.7400, and the confidence is 0.0133-0.0175, all of which are at the 0.05 level of confidence. The average confidence is 0.0149.
By comparing the NDVIa and the NDVIm in April (Table 6), it is found that from 2001 to 2005, the coefficients of the NDVIa and the NDVIm were 1.1504-1.1823, the average coefficient was 1.1637, and the intercept was − 0.0415 to − 0.0272. The average intercept is − 0.0332, the correlation coefficient r is 0.9070-0.9137, the average correlation coefficient is 0.9102, the determination coefficient R 2 is 0.8070-0.8362, the average correlation coefficient is 0.8184, and the confidence is 0.0069-0.0081, which are all within the confidence level of 0.01. The average confidence is 0.0076.
Through the comparison of the NDVIa and the NDVIm in July (Table 7), it is found that from 2001 to 2005, the coefficients of the NDVIa and the NDVIm are 1.0928-1.1191, the average coefficient is 1.1026, the intercept is 0.0229-0.0382, and the average intercept is 1.0928-1.1191. The distance is 0.0301, the correlation coefficient r is 0.9341-0.9395, the average correlation coefficient is 0.9370, the determination coefficient R2 is 0.7741-0.8008, the average correlation coefficient is 0.7870, and the confidence is 0.0149-0.0173, all at the 0.05 level of confidence, the average confidence It is 0.0160.
By comparing the NDVIa and NDVIm in October (Table 8), it is found that from 2001 to 2005, the coefficient of NDVIa and NDVIm was 1.1349-1.1809, the average coefficient was 1.1523, and the intercept was − 0.0330 to − 0.0113. The average intercept is -0.0189, the correlation coefficient r is 0.8903-0.9048, the average correlation   Table 9.
After the NDVIv is corrected, compared with the NDVIm (Table 8), the correlation coefficient between the NDVIv and the NDVIm before correction is 0.7238-0.8929, and the correlation coefficient after correction is increased to 0.9126-0.9445, and the correlation coefficient of NDVIv before correction is 0.9126-0.9445. The coefficient of determination of the NDVIv value and the NDVIm is 0.6072-0.8299, and the corrected coefficient of determination is increased to 0.9002-0.9326, and the confidence level is also increased from the original 0.05 or 0.01 level to above the 0.01 level, even reaching the 0.001 level confidence level. At the same time, it can be seen from Fig. 5 that the revised NDVIv has a substantially improved consistency compared with the NDVIm. Prove that our correction method is feasible.

Conclusion and discussion
Through research, we found that there is a significant correlation between the NDVIa and the NDVIm, and there is also a significant correlation between the NDVIv and the NDVIa. The relationship between the three is NDVIv < NDVIa < NDVIm.
Using the constructed linear model in machine learning, the NDVIv was corrected, and compared with the NDVIm, there is a good consistency. The correlation coefficient before correction is 0.7238-0.8929, the correlation coefficient after correction is significantly improved to 0.9126-0.9445; the determination coefficient before correction is 0.6072-0.8299, the correlation coefficient after correction is significantly improved to 0.9002-0.9326, and the confidence levels are all significant correlations less than 0.01. It is proved that the corrected NDVIv has significantly improved accuracy and product quality compared with the NDVIm.
In addition, we have the following thoughts. Firstly, in this study, we use linear machine learning models to correct the NDVIv in parts of China and surrounding areas. In some areas, the NDVIv and the NDVIm may not have a linear relationship, and it is likely to be a non-linear relationship, so in future research, different machine learning models should be used to correct Fengyun satellite products, such as decision trees, neural networks, and support vector machines. At the same time, we should construct regional machine learning models for parts of China and surrounding areas according to different terrains, different meteorological conditions, and different atmospheric conditions, and use different machine learning methods for different regions to correct Fengyun Satellite products. This may be able to more objectively and accurately correct the NDVIv value to a level equivalent to that of NDVIm.
Secondly, we are currently correcting the Fengyun satellite data from the product level. Although the corrected product is closer to the MODIS product value, it is a physical correction, and the mechanism is not very strong. In the future, we will consider revising the Fengyun Satellite's NDVI products from the near-infrared and infrared bands, compare and analyze the Fengyun Satellite's infrared band with the MODIS infrared band, and conduct a comparative analysis on the Fengyun Satellite's near-infrared band and MODIS's near-infrared band. It is possible to correct the Fengyun Satellite's NDVI products to a level closer to that of NDVIm. At the same time, it is also possible to find out the reasons for the inaccuracy of the Fengyun Satellite inversion values, thereby improving the Fengyun Satellite's own monitoring accuracy.
Thirdly, studies have shown that there is still a certain gap between the NDVIm or NDVIa and the NDVI values observed on the ground 34,35 . Therefore, it is also unreasonable to use the NDVIm as the true value of NDVI. It should also be considered to compare and analyze the Fengyun Satellite's NDVI product with the NDVI value retrieved from ground observation data, so as to improve the accuracy of the Fengyun Satellite's NDVI product, and it is also a meaningful thing to improve the accuracy of the Fengyun Satellite.
The product of Fengyun Satellite is not well applied in the process of use at present, and one of the most important reasons is that its observation value differs greatly from the actual value. The NDVI observed by MODIS at present is internationally recognized as a relatively accurate observation value. Therefore, one of the

Data availability
The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request.
Received: 8 August 2022; Accepted: 13 June 2023 Table 9. The optimal function construction of NDVIv fitting and its comparison with NDVIm. *Is the confidence level of 0.05, **is the confidence level of 0.01, ***is the confidence level of 0.001.